Introduction

The goal of this project is to construct various models that could predict violent recidivism and recidivism in general in Broward County, FL. This project occurs against the backdrop of recent research conducted by Propublica on algorithmic bias. When Propublic analyzed COMPAS, a common risk assessment used to score individuals in the criminal justice who are likely to recidivate, the organization found traces of racial bias. In particular, African-Americans were more likely to be misclassified as risky. Caucasian individuals were more likely to be misclassified in the opposite direction.

Our group accrued the same dataset utilized by Propublic to create our own models and compare results, within this context.

This report consists of five key sections:

Introduction
Data Cleanup and Transformation
Exploratory Data Analysis
Model Selection and Self-Comparison
Comparisons to COMPAS
Discussion of Results

Because the primary task of this project centers around model design and performance, we put much effort into exploring, tuning and analyzing a variety of classification methods. This includes additive logistic regressions, discriminant analyses, KNN, and decision-trees. For each method, we included a confusion matrix and ROC object, from which we could extract key performance metrics. We did not enter this project with strong priors about which method could be best, but rather assumed different methods could perform differentially on key performance metrics.

Most metrics found that the strongest predictors of recidivism were age, sex, and prior involvement with the criminal justice system.

When we reached the end of our analysis, we found that simpler models often performed on par with complex models. A pruned forest worked best for crime in general, but a linear discriminant analysis performed admirably on the violent recidivism outcome. When comparing our highest performing models with the metrics observed from COMPAS performance metrics, we witnessed that we were often able to yield similar rates of sensitivity with more specificity. This was extremely marked once we observed how our models impacted individuals of different sex and race.

Before running, make sure following files are in the current directory:

violent_ROC_plot.RData
performance.metrics.violent.RData
df.twoyear.violent.RData

Data Wrangling

Reading In Files

wd = getwd()

First, we imported the relevant libraries and read in the file containing data on two-year recidivism.

# All below from pro-publica
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(prettydoc)
library(ggplot2)
library(fastDummies)

## Warning: package 'fastDummies' was built under R version 3.5.3

library(leaps)

## Warning: package 'leaps' was built under R version 3.5.3

library(corrplot)

## corrplot 0.84 loaded

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(pROC)

## Warning: package 'pROC' was built under R version 3.5.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

### READ IN compas-scores-two-years.csv ###
raw_data <- read.csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv")

## Violent - ran prior, saved key objects, and load when needed ## 
# raw_data <- read.csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv")

# Setting seed for consistency
set.seed(123)

### CREATE DATAFRAME AND FILTER AS DESIRED ###
df <- dplyr::select(raw_data, age, c_charge_degree, race, 
                    age_cat, score_text, sex, priors_count, 
                    days_b_screening_arrest, decile_score, 
                    is_recid, two_year_recid, c_jail_in, c_jail_out) %>% 
  dplyr::filter(days_b_screening_arrest <= 30, days_b_screening_arrest >= -30,
         is_recid != -1, c_charge_degree != "O", score_text != 'N/A') %>% 
  dplyr::select(-days_b_screening_arrest)
nrow(df)

## [1] 6172

We duplicated this for the dataframe associated with violent recidivism.

vraw_data <- read.csv("https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv")

library(dplyr)
vdf <- dplyr::select(vraw_data, age, c_charge_degree, race, 
                    age_cat, score_text, sex, priors_count, 
                    days_b_screening_arrest, decile_score, 
                    is_recid, two_year_recid, c_jail_in, c_jail_out) %>% 
  filter(days_b_screening_arrest <= 30, days_b_screening_arrest >= -30,
         is_recid != -1, c_charge_degree != "O", score_text != 'N/A')%>% 
  dplyr::select(-days_b_screening_arrest)
nrow(vdf)

## [1] 4015

Dummies and Additional Features

Below, we undertook some basic data wrangling to ensure we had dummies when necessary for the classification tasks. We also ensured column names were in a form recognizable to R. We did this for both the violent and nonviolent datasets.

library(fastDummies)

## Warning: package 'fastDummies' was built under R version 3.5.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dummy_col_names = c("c_charge_degree","race","age_cat","score_text","sex")
df_dummies <- fastDummies::dummy_cols(.data=df,select_columns=dummy_col_names)
df_dummies <-  df_dummies %>% dplyr:: select(-dummy_col_names)

### how to fix a caret ###
names(df_dummies)[which(names(df_dummies)=="race_African-American")] = "race_African_American"
names(df_dummies)[which(names(df_dummies)=="race_Native American")] = "race_Native_American"
colnames(df_dummies)

##  [1] "age"                     "priors_count"           
##  [3] "decile_score"            "is_recid"               
##  [5] "two_year_recid"          "c_jail_in"              
##  [7] "c_jail_out"              "c_charge_degree_F"      
##  [9] "c_charge_degree_M"       "race_Other"             
## [11] "race_African_American"   "race_Caucasian"         
## [13] "race_Hispanic"           "race_Asian"             
## [15] "race_Native_American"    "age_cat_Greater than 45"
## [17] "age_cat_25 - 45"         "age_cat_Less than 25"   
## [19] "score_text_Low"          "score_text_Medium"      
## [21] "score_text_High"         "sex_Male"               
## [23] "sex_Female"

#Violent
#Now With Violence
vdummy_col_names = c("c_charge_degree","race","age_cat","score_text","sex")
vdf_dummies <- fastDummies::dummy_cols(.data=vdf,select_columns=vdummy_col_names)
vdf_dummies <-  vdf_dummies %>% dplyr:: select(-vdummy_col_names)

names(vdf_dummies)[which(names(vdf_dummies)=="race_African-American")] = "race_African_American"
names(vdf_dummies)[which(names(vdf_dummies)=="race_Native American")] = "race_Native_American"
colnames(vdf_dummies)

##  [1] "age"                     "priors_count"           
##  [3] "decile_score"            "is_recid"               
##  [5] "two_year_recid"          "c_jail_in"              
##  [7] "c_jail_out"              "c_charge_degree_F"      
##  [9] "c_charge_degree_M"       "race_Other"             
## [11] "race_African_American"   "race_Caucasian"         
## [13] "race_Hispanic"           "race_Asian"             
## [15] "race_Native_American"    "age_cat_Greater than 45"
## [17] "age_cat_25 - 45"         "age_cat_Less than 25"   
## [19] "score_text_Low"          "score_text_High"        
## [21] "score_text_Medium"       "sex_Male"               
## [23] "sex_Female"

# From propublica
df_dummies$length_of_stay <- as.numeric(as.Date(df$c_jail_out) - as.Date(df$c_jail_in))
df_dummies <- df_dummies %>% dplyr::select(-c("c_jail_out", "c_jail_in"))
cor(df_dummies$length_of_stay, df_dummies$decile_score)

## [1] 0.2073297

#Violent
vdf_dummies$length_of_stay <- as.numeric(as.Date(vdf$c_jail_out) - as.Date(vdf$c_jail_in))
vdf_dummies <- vdf_dummies %>% dplyr::select(-c("c_jail_out", "c_jail_in"))
cor(vdf_dummies$length_of_stay, vdf_dummies$decile_score)

## [1] 0.2093402

#Additional Cleanup; Versions without Dummies
df.twoyear <- df %>% dplyr::select(-c("age_cat", "score_text", "decile_score", "is_recid",
                               "c_jail_in", "c_jail_out"))
df.twoyear$two_year_recid <- as.factor(df.twoyear$two_year_recid)
df.twoyear$length_of_stay <- as.numeric(as.Date(df$c_jail_out) - as.Date(df$c_jail_in))

Exploratory Data Analysis

In the sections below, we sought to gain some understanding of the underlying data. In particular, we were interested in relationships across two-year recidivism and the features available to us in the data. We also include Probublica’s visualization of the inconsistencies across score because it was helpful as we thought about structure.

# From ProPublica
summary(df$age_cat)

##         25 - 45 Greater than 45    Less than 25 
##            3532            1293            1347

xtabs(~ sex + race, data=df)

##         race
## sex      African-American Asian Caucasian Hispanic Native American Other
##   Female              549     2       482       82               2    58
##   Male               2626    29      1621      427               9   285

summary(df$sex)

## Female   Male 
##   1175   4997

nrow(filter(df, two_year_recid == 1))

## [1] 2809

nrow(filter(df, two_year_recid == 1)) / nrow(df) * 100

## [1] 45.51199

#Violent
summary(vdf$age_cat)

##         25 - 45 Greater than 45    Less than 25 
##            2299             950             766

xtabs(~ sex + race, data=vdf)

##         race
## sex      African-American Asian Caucasian Hispanic Native American Other
##   Female              393     1       335       61               0    49
##   Male               1525    25      1120      294               7   205

summary(vdf$sex)

## Female   Male 
##    839   3176

nrow(filter(vdf, two_year_recid == 1))

## [1] 652

nrow(filter(vdf, two_year_recid == 1)) / nrow(df) * 100

## [1] 10.56384

# From ProPublica
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggplot2)

pblack <- ggplot(data=filter(df, race =="African-American"), aes(ordered(decile_score))) + 
          geom_bar() + xlab("Decile Score") +
          ylim(0, 650) + ggtitle("Black Defendant's Decile Scores")
pwhite <- ggplot(data=filter(df, race =="Caucasian"), aes(ordered(decile_score))) + 
          geom_bar() + xlab("Decile Score") +
          ylim(0, 650) + ggtitle("White Defendant's Decile Scores")
grid.arrange(pblack, pwhite,  ncol = 2)

The Propublica analysis showed clear, disparate patterns in the scores of its defendants. Notably, distributions across score appeared nearly uniform for Black Defendants,yet skewed heavily to the right for white defendants. We sought more granular information about the types of individuals who tend to recidivate, and produced a series of visuals and histograms that could communicate the actual, observed two-year recidivism across groups.

In general, we saw that the age of defendants tended to be younger, with a distribution that is strongly skewed to the right. There is a peak around age 25, after which case there is a gradual decline. When overlaying this data with the proportion of each age that recidicated, we observed a similar trend. The recidivism patterns, as a proportion of the age bin, we starkly different. Younger individuals recidivated at nearly even proportions, while this activity fell as age increased.

library(ggplot2)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(scipen = 4)
#Relationship Between Age, Race, Sex, and Priors and Recidicism

ggplot(df, aes(x=age)) + geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") + labs(title="Composition of Arrests by Age", y="Frequency", x="Age") + geom_vline(aes(xintercept=mean(age)),
            color="blue", linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=df, mapping=aes(x=age, fill=factor(two_year_recid))) + geom_bar() + labs(title="Age Breakdown by Two Year Recidivism")

ggplot(data=df, mapping=aes(x=race, fill=factor(two_year_recid))) + geom_bar() + labs(title="Race Breakdown by Two Year Recidivism")

ggplot(data=df, mapping=aes(x=priors_count, fill=factor(two_year_recid))) + geom_bar() + labs(title="Number of Priors")

ggplot(data=df, mapping=aes(x=sex, fill= factor(two_year_recid))) + geom_bar() + labs(title="Sex Breakdown by two-year recid")

We also observed breakdowns by race and by sex. The majority of inmates were male, compared to female. They experienced similar recidivism rates as a proportion to their size. For our graph exploring race, we observed that African-Americans were highly represented in the system, exhibiting a plurality of defendants. The second most populous group was Caucasian, with smaller representation of Hispanics, Asians, and those classified as Other.

Because we assumed the past could be some indicator of the present, we also mapped out the number of priors. We observed that the majority of individuals were “first-timers”; the bulk of our dataset had fewer than 2 previous encounters with the criminal justice system. However, as the number of priors increased, naturally, the percentage that recidivated within two years also increased. Therefore, we received some indication that frequent criminal activity is positively correlated with the likelihood to recidivate within two years.

Correlations

We were interested in how these features interacted with each other, so we created correlation plots with both the violent and general recidivism data. The plots are shown below.

library(corrplot)

## corrplot 0.84 loaded

# reorder so outputs of interest are at the beginning
df_dummies <- df_dummies %>% dplyr::select("two_year_recid", "is_recid", "decile_score", 
                                    everything())
correlations <- cor(df_dummies, use = "pairwise.complete.obs")

corrplot(correlations, diag = FALSE, type = "upper", tl.cex = 0.6)

#For Violent
vdf_dummies <- vdf_dummies %>% dplyr::select("two_year_recid", "is_recid", "decile_score",                              everything())
vcorrelations <- cor(vdf_dummies, use = "pairwise.complete.obs")
corrplot(vcorrelations, diag = FALSE, type = "upper", tl.cex = 0.6)

The first thing we noticed is that several of the original predictors were almost perfectly collinear or non-collinear. In particular, indicators related to age, score, sex, and race were all strongly correlated with one another. Because this indicated that these variables were all essentially capturing the same phenomenon, we used this guidance to eliminate redundant variables.

Beyond perfectly collinear variables, a few other trends emerged. In particular, there were strong, negative correlations between African-American defendants and a likelihood of a low score. Scores themselves were correlated with recidivism in ways we might expect, but in moderate ways. This correlation plot revealed that low scores were negatively correlated with recidivism and high scores were positively correlated with recidivism. This is important, because it suggests the COMPAS scores are able to pick up some key, meaningful behavioral differences in individuals who are likely (or not) to recidivate. However, the magnitude of this correlation was not particular strong, and we were curious about how the score seemed to weight the features in our model.

Variable Selection

Before running models, we wanted to observe what variables were most important. We ran exhaustive tests using regsubsets on both datasets. We were also curious what factors predicted socring mechanisms, so we included those in the results.

library(leaps)

## Warning: package 'leaps' was built under R version 3.5.3

# assuming we can't know is_recid as a predictor at time of arrest
# must get rid of highly collinear vars, non predictive vars, otherwise redundant vars
twoyear.predictors <- df_dummies %>% dplyr::select(-c("is_recid", "decile_score", 
                                               "score_text_Low", "score_text_Medium",
                                               "score_text_High", "c_charge_degree_M",
                                               "race_Caucasian", "age_cat_Greater than 45",
                                               "age_cat_25 - 45", "age_cat_Less than 25",
                                               "sex_Female"))
twoyear.predictors$two_year_recid <- as.factor(twoyear.predictors$two_year_recid)


score.predictors <- df_dummies %>% dplyr::select(-c("is_recid", "two_year_recid", 
                                               "score_text_Low", "score_text_Medium",
                                               "score_text_High", "c_charge_degree_M",
                                               "race_Caucasian", "age_cat_Greater than 45",
                                               "age_cat_25 - 45", "age_cat_Less than 25",
                                               "sex_Female"))
violent_recid.predictors <- vdf_dummies %>% dplyr::select(-c("is_recid", "decile_score", 
                                               "score_text_Low", "score_text_Medium",
                                               "score_text_High", "c_charge_degree_M",
                                               "race_Caucasian", "age_cat_Greater than 45",
                                               "age_cat_25 - 45", "age_cat_Less than 25",
                                               "sex_Female"))
# Getting variables for two year recid
subset.twoyear <- regsubsets(two_year_recid ~ .,
                             data = twoyear.predictors,
                             nbest = 1,
                             nvmax = 12,
                             method = "exhaustive")


# Getting variables for score decile
subset.score <- regsubsets(decile_score ~ .,
                           data = score.predictors,
                           nbest = 1,
                           nvmax = 12,
                           method = "exhaustive")
# Getting variables for violent recid
subset.violent_recid <- regsubsets(two_year_recid ~ .,
                             data = violent_recid.predictors,
                             nbest = 1,
                             nvmax = 12,
                             method = "exhaustive")

Upon receiving the top predictors for each model size, we created a dataframe of all results.

twoyear.sum <- summary(subset.twoyear)
predictors <- colnames(twoyear.sum$which[,2:length(colnames(twoyear.sum$which))])

twoyear.appearances <- sapply(as.data.frame(twoyear.sum$which), sum)
twoyear.appearances <- twoyear.appearances[2:length(twoyear.appearances)]

score.sum <- summary(subset.score)
score.appearances <- sapply(as.data.frame(score.sum$which), sum)
score.appearances <- score.appearances[2:length(score.appearances)]

violent_recid.sum <- summary(subset.violent_recid)
violent_recid.appearances <- sapply(as.data.frame(violent_recid.sum$which), sum)
violent_recid.appearances <- violent_recid.appearances[2:length(violent_recid.appearances)]

appearance.df <- data.frame("predictor" = names(score.appearances), 
                            "appearances" = score.appearances,
                            "metric" = rep("score", length(names(score.appearances))))

appearance.df <- rbind(appearance.df,
                       data.frame("predictor" = names(twoyear.appearances), 
                                  "appearances" = twoyear.appearances,
                                  "metric" = rep("two.year", 
                                                 length(names(twoyear.appearances)))))

appearance.df <- rbind(appearance.df,
                       data.frame("predictor" = names(violent_recid.appearances), 
                                  "appearances" = violent_recid.appearances,
                                  "metric" = rep("violent_recid", 
                                                 length(names(violent_recid.appearances)))))

rownames(appearance.df) <- c()

Regsubsets Analysis

Upon running exhaustive regsubsets models on both datasets, we observed that the subset feature selected age, priors, and sex as their strongest predictors. When overlayed this data with the top features observed in the COMPAS models, we received strongly conflicting information concerning weights.

ggplot(data = appearance.df) +
  geom_bar(mapping = aes(x = predictor, y = appearances, fill = metric), 
           position = "dodge", stat = "identity")+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  ggtitle("Appearences of predictors in different regsubset models")

Model Comparison

Below, we created a common training and test set using 80/20% of our data, respectively, which could be used across a slew of modeling options. Importantly, these training and test sets were randomized and identical across all initial model tests.

smp_size <- floor(0.80 * nrow(df))
train_ind <- sample(1:(nrow(df)), size = smp_size)

train <- df.twoyear[train_ind, ]
test <- df.twoyear[-train_ind, ]

x_train <- train[,-6]
y_train <- train$two_year_recid
x_test <- train[,-6]
y_test <- test$two_year_recid

#---------------Unique Test and Train Set for Level-Requiring formulas
pract <- df.twoyear
levels <- unique(pract$two_year_recid) 
pract$two_year_recid <- factor(pract$two_year_recid, labels=make.names(levels))

ctrain <- pract[train_ind, ]
ctest <- pract[-train_ind, ]

cx_train <- ctrain[,-6]
cy_train <- ctrain$two_year_recid
cx_test <- ctest[,-6]
cy_test <- ctest$two_year_recid

Discriminant Analysis

First, we ran a linear discriminant analysis and quadratic discriminant analysis without including tuning parameters.

library(klaR)
library(caret)

## Loading required package: lattice

#Traditional - Default Settings w/ No tuning parameters
  lda_fit <- lda(two_year_recid ~., data=train)
  qda_fit <- qda(two_year_recid ~., data=train)
  lda_fit$prior #Prevalence

##         0         1 
## 0.5428398 0.4571602

  lda_predict <- predict(lda_fit, test, type="response")
  qda_predict <- predict(qda_fit, test)
  qda_pred <- as.numeric(qda_predict$class)
  lda_fit$means

##        age c_charge_degreeM   raceAsian raceCaucasian raceHispanic
## 0 36.56455        0.4041045 0.006716418     0.3798507   0.09440299
## 1 32.07709        0.2924236 0.003544528     0.2888790   0.06778910
##   raceNative American  raceOther   sexMale priors_count length_of_stay
## 0         0.001865672 0.06716418 0.7798507     1.997015       10.11642
## 1         0.001772264 0.04607887 0.8577758     4.711121       20.58751

#LDA / QDA Plots with Best Variables from Lasso
#Images Show Binary classifications make it unclear where distinctions should rest
lda_plots <- partimat(two_year_recid ~ priors_count + c_charge_degree + length_of_stay + sex, method="lda", data=df.twoyear)

qda_plots <- partimat(two_year_recid ~ priors_count + c_charge_degree + length_of_stay + sex, method="qda", data=df.twoyear)

lda_cm <- caret::confusionMatrix(positive = "1", lda_predict$class, test$two_year_recid)
qda_cm <- caret::confusionMatrix(positive = "1", qda_predict$class, test$two_year_recid)

bin <- as.integer(y_test)
lda_bin <- as.integer(lda_predict$class)
qda_bin <- as.integer(qda_predict$class)

#lda_roc(lda_fit)
lda_roc <-roc(bin, lda_predict$posterior[,2])
qda_roc <- roc(bin, qda_predict$posterior[,2])

We found it useful to visualize what the best of a rather simple model could do. When we mapped out the decision boundaries produced by either method, we immediately realized that the method of classification might not be the best, given the diffuseness of individuals on either side of the decision boundary.

Below, we tuned the LDA function with built-in feature selection tools.

#Stepwsie LDA with own feature selection
train_control = trainControl(method="CV", number=10, classProbs=TRUE)

lda_stepwise <- train(cx_train,cy_train,  
                                method = "stepLDA", 
                                trControl = train_control,
                                metric = "ROC")

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63359;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        2.27

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63111;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##         0.0         0.0         1.6

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4444 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63501;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        2.12

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63178;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##         0.0         0.0         1.5

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63584;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.64

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.62683;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.66

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4444 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63255;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.67

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4444 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.62781;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.48

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63155;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.36

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4443 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.62953;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##        0.00        0.00        1.38

##  `stepwise classification', using 10-fold cross-validated correctness rate of method lda'.

## 4937 observations of 6 variables in 2 classes; direction: both

## stop criterion: improvement less than 5%.

## correctness rate: 0.63156;  in: "priors_count";  variables (1): priors_count 
## 
##  hr.elapsed min.elapsed sec.elapsed 
##         0.0         0.0         1.3

best_lda <- lda_stepwise[["finalModel"]][["formula"]]
print(best_lda)

## y ~ priors_count
## <environment: 0x0000000005e2e2c0>

best_lda <-lda(two_year_recid ~priors_count, data=ctest)
best_lda_pred <- predict(best_lda, ctest, type="raw")
print(best_lda)

## Call:
## lda(two_year_recid ~ priors_count, data = ctest)
## 
## Prior probabilities of groups:
##        X0        X1 
## 0.5530364 0.4469636 
## 
## Group means:
##    priors_count
## X0     1.945827
## X1     4.932971
## 
## Coefficients of linear discriminants:
##                    LD1
## priors_count 0.2201716

#Confusion Matrix
pred = best_lda_pred[["class"]]
pred = plyr::revalue(pred, c("X0"="0", "X1"="1"))
temp_test = plyr::revalue(cy_test, c("X0"="0", "X1"="1"))
lda_stepwise_cm <- caret::confusionMatrix(pred, temp_test,positive='1')

#ROCstep_bin, test_bin
lda_stepwise_roc <-roc(cy_test, best_lda_pred$posterior[,2])
print(lda_stepwise_roc)

## 
## Call:
## roc.default(response = cy_test, predictor = best_lda_pred$posterior[,     2])
## 
## Data: best_lda_pred$posterior[, 2] in 683 controls (cy_test X0) < 552 cases (cy_test X1).
## Area under the curve: 0.6884

Cite: https://github.com/topepo/caret/blob/master/RegressionTests/Code/stepLDA.R

Naive Bayes

For good measure, and for simplicity, we also ran a Naive Bayes test on the dataset. Ultimately, the three methods performed OK.

library(caret)
library(klaR)
#Naive Bayes
    #Traditional
nb_fit <- NaiveBayes(two_year_recid ~ priors_count + c_charge_degree + length_of_stay + sex, data=train, usekernel=TRUE)
nb_predict <- predict(nb_fit, test, type="response")
nb_cm <- confusionMatrix(positive = "1", nb_predict$class, test$two_year_recid)
  
    #Tuning
nb_pred <- rep(0, nrow(test))
nb_pred[nb_predict$posterior[,2] > 0.5] = 1
    
nb_roc <- roc(test$two_year_recid, nb_predict$posterior[,1])

library(pROC)

## Warning: package 'pROC' was built under R version 3.5.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

ggroc(list(NB = nb_roc, LDA = lda_roc, QDA=qda_roc, LDA_Step = lda_stepwise_roc)) + theme_light() + ggtitle("Discriminant Analysis / NB ROC Curves")

LDA appeared to be the best classifier among the different permutations of these groups. This also held true in the violent recidivism dataset.
We next observed how well additive models could do.

Logistic Regression + Additive Models

Several logistic regressions were developed using the Caret package. First, we used Recursive Feature Elimination (RFE) and 10-fold Cross-Validation to exhaustively test all combinations of parameters in a logistic regression. RFE is Caret’s way of allowing for best subset selection, with a wider range of models than the linear regression subset functions. RFE produced a logistic model with the features shown below. The 1-SE rule was used to trade complexity for a small amount of standard deviation.

library(caret)
# Model creation using caret 
train_control = trainControl(method="CV", number=10)

### LOGISTIC REGRESSION BEST SUBSET SELECTION USING CARET BACKWARDS FEATURE SELECTION (rfe- recursive feature elimination) METHODS ### 
### http://topepo.github.io/caret/recursive-feature-elimination.html#rfe ###
x = twoyear.predictors %>% dplyr::select(-c("two_year_recid"))
#x <- x[train_ind, ]
y = as.factor(twoyear.predictors$two_year_recid)
#y <- y[train_ind]
#normalization <- preProcess(x)
#x <- predict(normalization, x)
x <- as.data.frame(x)

# Tell rfe what kind of model, how to perform cross-validation
ctrl <- rfeControl(method = "repeatedcv",
                   repeats = 10,
                   verbose = FALSE,
                   functions = lrFuncs)

# Cross-validate to find best model with each # of predictors included 
best_logistic_model = rfe(x[train_ind,], y[train_ind], sizes = seq(1,length(names(x))),
  metric = ifelse(is.factor(y[train_ind]), "Accuracy", "RMSE"),
  maximize = ifelse(is.factor(y[train_ind]), TRUE, FALSE),
  rfeControl = ctrl,
  method="glmnet")

plot(best_logistic_model, type = c("g", "o"))

best_logistic_model_1se = update(best_logistic_model,x[train_ind, ], y[train_ind],size=2)

## Warning in update.rfe(best_logistic_model, x[train_ind, ], y[train_ind], :
## The saved resamples are no longer appropriate and were removed

# Outputs
best_logistic_pred = predict(best_logistic_model_1se,newdata = x[-train_ind,])[,3]
best_logistic_cm = caret::confusionMatrix(positive = "1", best_logistic_pred, y[-train_ind])
best_logistic_prob = predict(best_logistic_model_1se,newdata = x[-train_ind,])[,2]
logistic_roc = roc(y[-train_ind], best_logistic_prob)

best_logistic_model

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 10 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.6316 0.2288    0.01750 0.03731         
##          2   0.6755 0.3338    0.01794 0.03712         
##          3   0.6793 0.3433    0.01793 0.03724         
##          4   0.6806 0.3460    0.01802 0.03714        *
##          5   0.6784 0.3430    0.01922 0.03948         
##          6   0.6768 0.3400    0.01895 0.03882         
##          7   0.6762 0.3389    0.01847 0.03779         
##          8   0.6760 0.3387    0.01800 0.03687         
##          9   0.6754 0.3376    0.01832 0.03757         
##         10   0.6754 0.3375    0.01835 0.03759         
## 
## The top 4 variables (out of 4):
##    priors_count, age, sex_Male, length_of_stay

Next, a logistic regression was developed using the lasso technique with 10-fold cross validation. In Caret, Lasso has internal feature selection, so the methods automatically provided an optimum alpha and lambda value of alpha = 1 and lambda = 0.002844731.

The figure above shows the Recursive Feature Elimination (RFE) plot for logistic regression as the number of variables vs. Accuracy. While peak Accuracy occurred at 6, this modeled settled on 2 due to the 1-SE rule.

### Logistic Regression using Lasso (has built-in feature selection) ### 
lasso_logistic_model = caret::train(two_year_recid ~ ., trControl=train_control,method="glmnet", 
                             family="binomial", metric="Accuracy", data=twoyear.predictors[train_ind, ])

#plot(lasso_logistic_model, type = c("g", "o"))
# Should do 1-se but currently doesn't 

# Outputs
lasso_logistic_pred = predict(lasso_logistic_model,newdata = x[-train_ind, ], type = "raw")
lasso_logistic_cm = caret::confusionMatrix(positive = "1", lasso_logistic_pred, y[-train_ind])
lasso_logistic_prob = predict(lasso_logistic_model,newdata = x[-train_ind, ], type = "prob")[,2]
lasso_roc = roc(y[-train_ind], lasso_logistic_prob)

Finally, an additive spline model logistic model was developed by fitting the parameters with a regular additive model, and graphically inspecting each parameter for higher-order relationships. Due to this inspection, ‘age’, ‘priors_count’, and ‘length_of_stay’ were chosen as 3rd and 4th degree spline terms for a logistic regression including all other parameters.

### Spline Fit Logistic Regression ###
library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.16

spline.gam = gam(data=twoyear.predictors[train_ind, ],
                 formula = two_year_recid ~ s(age, 4) + s(priors_count, 4) + s(length_of_stay, 4),family=binomial)

# Use to pick spline terms and order
# par(mfrow = c(10,2))
# plot(spline.gam, col="ForestGreen")

spline2.gam = gam(data=twoyear.predictors[train_ind, ],
                  formula = two_year_recid ~ . + s(age, 4) + s(priors_count, 3) + s(length_of_stay, 4),family=binomial)

#Outputs
spline_pred = as.factor(as.integer(predict(spline2.gam, 
                                           x[-train_ind, ], type = "response") >= 0.5))
spline_pred_cm = caret::confusionMatrix(positive = "1", spline_pred, y[-train_ind])
spline_prob = predict(spline2.gam, x[-train_ind, ], type = "response")
spline_roc = roc(y[-train_ind], spline_prob)

KNN

KNN A K-Nearest Neighbors Model was developed using the Caret package, with 10-fold CV and all predictors included. Using Caret’s auto-feature selection, the model was chosen with k = 43, with a yield of 0.67 Accuracy. While a KNN model was not ultimately used for our chosen, the 1-SE rule shows that we could have used k as low as 7 with similar results.

### K-NN ###
best_knn_model = caret::train(two_year_recid ~ ., data =twoyear.predictors[train_ind, ],
                       method="knn", trControl=train_control, tuneLength = 20)

# Outputs
knn_pred = predict(best_knn_model, newdata = x[-train_ind, ], type = "raw")
knn_pred_cm = caret::confusionMatrix(positive = "1", knn_pred, y[-train_ind])
knn_prob = predict(best_knn_model, newdata = x[-train_ind, ], type = "prob")[,2]
knn_roc = roc(y[-train_ind], knn_prob)

For each of these logistic and Caret models, ROC curves show that the models perform very similarly, though no model is fully dominant- some have better Sensitivity at low cutoffs, while others trade for higher values at higher cutoffs.

Tree Methods

The first tree-based method that we decided to build is a simple pruned tree. Trees are arguably the most interpretable classification models and therefore could be very useful in the context of the justice system. Judges and defendants both would know exactly why someone is classified the way they are.

library(rpart)
library(partykit)

## Warning: package 'partykit' was built under R version 3.5.3

## Loading required package: grid

## Loading required package: libcoin

## Warning: package 'libcoin' was built under R version 3.5.3

## Loading required package: mvtnorm

## Warning: package 'mvtnorm' was built under R version 3.5.2

# From HW5

#tree.twoyear <- rpart(formula = two_year_recid ~ ., data = df.twoyear)
#party.twoyear <- as.party(tree.twoyear)

# Growing out large tree
twoyear.tree.full <- rpart(formula = two_year_recid ~ ., data = df.twoyear, 
                           subset = train_ind, 
                           control = rpart.control(minsplit=100, cp=0.002))
plotcp(twoyear.tree.full)

# Using one SE rule to determine what value of cp to use
One.SE <- which.min(twoyear.tree.full$cptable[, "xerror"])
One.SE.err <- twoyear.tree.full$cptable[One.SE, "xerror"] + 
  twoyear.tree.full$cptable[One.SE, "xstd"] 

(One.SE.index <- min(which(twoyear.tree.full$cptable[, "xerror"] <= One.SE.err)))

## [1] 5

(One.SE.cp <- twoyear.tree.full$cptable[One.SE.index, "CP"])

## [1] 0.003544528

First, we use the rpart library to grow out a large tree. Next, we use the 1-se rule to determine the least complex model that we can realistically use. Above is a plot of error vs cp - the dashed line represents the 1-SE error.

We use our chosen value of cp to prune back our tree. Next, we used the partykit library to build a visualization of this pruned tree. We can see that the tree only uses priors and age to determine probabilities of recidivism.

twoyear.pruned <- prune(tree = twoyear.tree.full, cp = One.SE.cp)
twoyear.pruned.party <- as.party(twoyear.pruned)

plot(twoyear.pruned.party, gp = gpar(fontsize = 8))

pruned.predictions <- predict(twoyear.pruned, 
                              newdata = df.twoyear[-train_ind,], type = "class")
y <- df.twoyear[-train_ind, "two_year_recid"]
pruned_cm <- caret::confusionMatrix(positive = "1", pruned.predictions, y)

pruned.probabilities <- predict(twoyear.pruned, 
                                newdata = df.twoyear[-train_ind,])[,2]
pruned.roc <- roc(response = y,
                  predictor = pruned.probabilities)
plot(pruned.roc)

#### Random Forest

Next we build a random forest model. Random forests aren’t easily interpretable, but for certain problems they can perform very well. We tried two models, varying the number of variables to split on between 2 and 3. We then plotted the Out of Bag Error vs Number of trees for both models. We observed that the model that allows 2 variables per split outperforms the one that allows 3 per split. We also observed that by around 300 trees, the error has largely leveled out; therefore, we choose the number of trees minimizing OOB Error that is under 300 trees.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

rf.twoyear.2 <- randomForest(two_year_recid ~ ., 
                             data = df.twoyear[train_ind,], 
                             mtry = 2 ,importance = TRUE)
rf.twoyear.3 <- randomForest(two_year_recid ~ ., 
                             data = df.twoyear[train_ind,], 
                             mtry = 3,importance = TRUE)

plot(rf.twoyear.2$err.rate[,1], type = "line", col = "indianred1",
      xlab="Number of trees", ylab="Out of Bag Error")
lines(rf.twoyear.3$err.rate[,1], col = "steelblue")
legend("topright", c("3 variables per split", "2 variables per split"),
       cex=0.8, fill=1:4)

# Choose 2 variables per split tree with 150 trees
rf.twoyear <- randomForest(two_year_recid ~ ., 
                           data = df.twoyear[train_ind,], 
                           mtry = 2, 
                           ntree = which.min(rf.twoyear.2$err.rate[1:300]), 
                           importance = TRUE)

We use importance plots to determine which predictors are contributing the most to outcomes. With mean decrease in accuracy, we see that age and priors_count are the most impactful. With mean decrease in Gini, we see that length_of_stay also has a large impact.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

varImpPlot(rf.twoyear)

** Most impactful predictors seem to be age, priors, and length of stay **

rf.predictions <- predict(rf.twoyear, 
                          newdata = df.twoyear[-train_ind,], 
                          type = "class")
y <- df.twoyear[-train_ind, "two_year_recid"]
rf_cm <- caret::confusionMatrix(positive = "1", rf.predictions, y)

rf.probabilities <- predict(rf.twoyear, newdata = df.twoyear[-train_ind,],
                            type = "prob")[,2]
rf.roc <- roc(response = y,
              predictor = rf.probabilities)
plot(rf.roc)

Analyzing Our Models

Underlying our positioning and framing of the Risk Assessment Instrument was a question of bias and tradeoffs. We built our assessment around the following performance metric equations:

Sensitivity = TP/(TP + FN) Higher Sensitivity represents better predicting a higher # of those who recidivate.
Specificity = TN/(TN + FP) Higher specificity represents better not predicting those who ultimately do not recidivate.
Positive Predictive Value = TP/(TP + FP) Higher PPV represents an improved quality of prediction, with a higher % of our recidivation predictions being correct.

Philosophically, the trade-offs between these performance metrics may mean the difference between wrongly sentencing a person unlikely to recidivate, as well as lightly sentencing someone who may be very likely to recidivate. This is a historic debate had in law schools and societies.

Recognizing the prevalence of racial bias in tools, we aimed to maximize each Sensitivity, Specificity, and PPV, with priorities on Specificity and PPV. We plotted the ROC curves of our model to look for clear winners in term of AUC. Unfortunately, most of our models performed similarly well. While it was possible to weed out models like Naive Bayes and the LDA-step function, we observed no clear cut victors.

ggroc(list(NB = nb_roc, LDA = lda_roc, LDA_Step = lda_stepwise_roc, caret_best = logistic_roc, lasso = lasso_roc, spline = spline_roc, knn = knn_roc, pruned_tree=pruned.roc, rf=rf.roc), ) + theme_light() + ggtitle("ROC Curves")

Next, to examine each performance metric, we generated a dataframe and bar graph to map out how each model performed.

Building a dataframe

performance.metrics <- data.frame("Best Logistic" = best_logistic_cm$byClass, "KNN" = knn_pred_cm$byClass, 
                                  "Lasso" = lasso_logistic_cm$byClass, "LDA" = lda_cm$byClass, 
                                  "LDA Step" = lda_stepwise_cm$byClass, "NB" = nb_cm$byClass, "Pruned" = pruned_cm$byClass,
                                  "RF" = rf_cm$byClass, "Splines" = spline_pred_cm$byClass)
performance.metrics <- rbind(performance.metrics, c(logistic_roc$auc, knn_roc$auc, lasso_roc$auc, lda_roc$auc, 
                                                    lda_stepwise_roc$auc, nb_roc$auc, pruned.roc$auc, qda_roc$auc,
                                                    rf.roc$auc, spline_roc$auc))
performance.metrics <- t(performance.metrics)
colnames(performance.metrics)[12] <- "AUC"
performance.metrics <- as.data.frame(performance.metrics)
performance.metrics <- performance.metrics %>% dplyr::select("AUC", everything())
performance.metrics <- cbind(performance.metrics, rownames(performance.metrics))
colnames(performance.metrics)[13] <- "Model"

Visualizing Results

perf.sens <- performance.metrics %>% dplyr::select("Sensitivity", "Model") %>% dplyr::mutate(score = rep("Sensitivity", 9))
perf.spec <- performance.metrics %>% dplyr::select("Specificity", "Model") %>% dplyr::mutate(score = rep("Specificity", 9))
perf.PPV <- performance.metrics %>% dplyr::select("Pos Pred Value", "Model") %>% dplyr::mutate(score = rep("PPV", 9))
colnames(perf.sens) <- c("Score", "Model", "Method")
colnames(perf.spec) <- c("Score", "Model", "Method")
colnames(perf.PPV) <- c("Score", "Model", "Method")

tradeoff.data <- data.frame(rbind(perf.sens, perf.spec, perf.PPV))

# Run for violent and save plot
ggplot(data = tradeoff.data) +
  geom_bar(mapping = aes(x = Model, y = Score, fill = Method), 
           position = "dodge", stat = "identity") +
  ylim(c(0,1))+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  ggtitle("Model Metrics")

Best Model: General Recidivism

Upon witnessing these results, we observed that a pruned forest model performed admirably on many key metrics. It scored highest on metrics of sensitivity (0.658), while maintaining a competitive PPV (0.68). Thus, it became our strongest model for two-year recidivism. We also chose the tree for its interpretability, which inherently makes risk assessment seem more fair.

We ran the same plots for violent recidivism:

load("./violent_ROC_plot.RData")
violent_ROC_plot

And produced similar visualizations:

#current = getwd()
library(caret)
library(dplyr)
load(file = "C:/Users/bjrhi/OneDrive/Documents/R/DataMining/final_submission/performance.metrics.violent.RData")

performance.metrics.violent <- cbind(performance.metrics.violent, rownames(performance.metrics.violent))
performance.metrics.violent <- as.data.frame(performance.metrics.violent)
colnames(performance.metrics.violent)[12] <- "Model"
perf.sens.v <- performance.metrics.violent %>% dplyr::select("Sensitivity", "Model") %>% dplyr::mutate(score = rep("Sensitivity", 9))
perf.spec.v <- performance.metrics.violent %>% dplyr::select("Specificity", "Model") %>% dplyr::mutate(score = rep("Specificity", 9))
perf.PPV.v <- performance.metrics.violent %>% dplyr::select("Pos Pred Value", "Model") %>% dplyr::mutate(score = rep("PPV", 9))
colnames(perf.sens.v) <- c("Score", "Model", "Method")
colnames(perf.spec.v) <- c("Score", "Model", "Method")
colnames(perf.PPV.v) <- c("Score", "Model", "Method")

tradeoff.data.v <- data.frame(rbind(perf.sens.v, perf.spec.v, perf.PPV.v))
tradeoff.data.v$Score <- as.numeric(levels(tradeoff.data.v$Score))[tradeoff.data.v$Score]
tradeoff.data.v$Model <- as.factor(tradeoff.data.v$Model)
tradeoff.data.v$Method <- as.factor(tradeoff.data.v$Method)

ggplot(data = tradeoff.data.v) +
  geom_bar(mapping = aes(x = Model, y = Score, fill = Method), 
           position = "dodge", stat = "identity") +
  ylim(c(0,1))+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  ggtitle("Violent Model Metrics")

Best Model: Violent General Recidivism

Out of 4015 separate observations in the violent recidivism dataset, there is only a prevalence of 4.4%. Thus, we are working with a data set with mostly non-recidivation. For this reason, at a 0.5 cutoff, our default for model comparison, most models have a very high Specificity with low Sensitivity. This is a fair bet, but means we are likely to under-classify those who are likely to violently recidivate. (Note: while Pro-Publica used a variation on the full two-year data-set, in the cleaning process we had data-integrity issues and decided to use the two-year-compass-violent.csv data set, which has fewer points).

Despite this limitation, we found that the LDA model best compromised between high quality results with its 0.42 PPV and pack-leading 0.18 Sensitivity.

Between Model Comparisons

Assessing Final Models: General Recidivism

Once we selected the Pruned Tree model for general recidivism and LDA for violent recidivism, we developed a Cross-Validation method (10-Fold) to test each data point against both models. Data was randomly divided into 10 folds, and 10-models (90% train, 10% hold out) were constructed and tested against the hold out sets. This method allowed us to assess accuracy without over-predicting by training and testing on the full data set.

## Warning: package 'partykit' was built under R version 3.5.3

## Warning: package 'libcoin' was built under R version 3.5.3

## Warning: package 'mvtnorm' was built under R version 3.5.2

big.tree <- rpart(formula = two_year_recid ~ ., data = df.twoyear,
                       control = rpart.control(minsplit = 100, cp = 0.002))
    
best.pruned.tree <- prune(tree = big.tree, cp = chosen.cp)

best.tree.party <- as.party(best.pruned.tree)
plot(best.tree.party, gp = gpar(fontsize = 8))

Our final pruned tree uses three variables to predict two year recidivism: priors_count, age, and sex. In general, the more priors you have, the younger your age, and being make all make you more likely to recidivate. If someone has more than 2.5 priors and is under 33.5 years old, they have about a 70% chance of recidivating within two years.

Assessing Final Models: Violent Recidivism

The most important positive predictors for LDA were: number of priors, race, age, and sex. The below model compares each race to a baseline of being black.

library(caret)

## Loading required package: lattice

library(klaR)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

best.model.v <- lda_fit

classificationCrossValidation <- function(df.twoyear.violent) {

  # Standardization 
  standardized_params = preProcess(df.twoyear.violent, method=c("center","scale"))
  standardized_data = predict(standardized_params,df.twoyear.violent)
  
  # 10-fold Cross-Validation to test for Classification Error 
  random_indecies = sample(1:(nrow(df.twoyear.violent)))
  start = 1
  fold_size = floor(nrow(df.twoyear.violent)/10)
  final_2y_scores = data.frame()
  final_2y_predics = c()
  y = c()
  indics = data.frame()
  
  for(i in seq(1,10)){
    
    if(i==10){
      range = seq(start + (i-1)*fold_size,nrow(df.twoyear.violent))
    } else{
      range = seq(start + (i-1)*fold_size,i*fold_size)
    }
   
    test_vector = random_indecies[range]
    
    fold_train = standardized_data[-test_vector,]
    fold_test = standardized_data[test_vector,]
    
    lda_fit <- lda(two_year_recid ~., data=fold_train)
    
    x <- as.data.frame(fold_test %>% dplyr::select(-c("two_year_recid")))
    y <- c(y, fold_test$two_year_recid)
    indics <- rbind(indics,df.twoyear.violent[test_vector,])
    
    # put final_2y_predics, y, indics into a daa frame together 
    
    final_2y_scores <- rbind(final_2y_scores,data.frame(predict(lda_fit, x, type = "response")))
    final_2y_predics <- as.factor(as.integer(final_2y_scores$posterior.1 >= 0.5))
  }
  
  y = as.factor(y-1)
  
  return(data.frame(predics = final_2y_predics, y = y, indics))
  
}

load(file="./df.twoyear.violent.RData")

# use on violent data set 
crossVals = classificationCrossValidation(df.twoyear.violent)
# Confusion Matrix for Cross-Validated Model 
best.cm.violent <- caret::confusionMatrix(positive = "1", crossVals$predics, crossVals$y)

##                             LD1
## priors_count         0.81362405
## raceAsian           -0.75541263
## age                 -0.61095683
## raceNative American -0.46139721
## sexMale              0.43324632
## raceOther           -0.32987988
## raceHispanic        -0.32162898
## c_charge_degreeM    -0.26612389
## length_of_stay       0.13496816
## raceCaucasian       -0.08863264

Comparing Results to COMPAS

Our best models selected, we began to compare our results to COMPAS. Here is how COMPAS performed, according to Propublica’s research. For all defendants, this is what they observed:

Total: 7214.00 False positive rate: 32.35 False negative rate: 37.40 Specificity: 0.68 Sensitivity: 0.63 Prevalence: 0.45 PPV: 0.61 NPV: 0.69

Our General Recidivism model performed overall with a 68.4% Accuracy. Compared to COMPAS it was 10-points better (lower) on overall False Positive Rates and 5.5-points worse (higher) on False Negatives. From our perspective of reducing bias, so far, so good, but at the price of not classifying more recidivators.

## [1] "COMPAS False Positive Rate: 32.35 | Our models False Positive Rate: 24.03"

## [1] "COMPAS False Positive Rate: 37.40 | Our models False Negative Rate: 42.72"

## [1] "--------------------------- "

## [1] "Violent False Positive Rate: 0.22 | Our models False Negative Rate: 99.28"

Our Violent Recidivism model on the other hand was actually more accurate- 83.9%. Why? Because with a smaller dataset, with a much smaller prevalence, the LDA was more likely to generate a negative assignment. Thus, resulting in an 86.04% False Negative Rate, and a just a 2.5% False Positive Rate. Compare this to COMPAS’ 27.93% False Positive Rate and a 47.15% False Negative Rate.

## [1] "White Confusion Matrix"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1065  468
##          1  216  354
##                                           
##                Accuracy : 0.6748          
##                  95% CI : (0.6543, 0.6948)
##     No Information Rate : 0.6091          
##     P-Value [Acc > NIR] : 2.553e-10       
##                                           
##                   Kappa : 0.2773          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4307          
##             Specificity : 0.8314          
##          Pos Pred Value : 0.6211          
##          Neg Pred Value : 0.6947          
##              Prevalence : 0.3909          
##          Detection Rate : 0.1683          
##    Detection Prevalence : 0.2710          
##       Balanced Accuracy : 0.6310          
##                                           
##        'Positive' Class : 1               
##

## [1] "White Confusion-Violence Matrix"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1916  186
##          1    1    0
##                                           
##                Accuracy : 0.9111          
##                  95% CI : (0.8981, 0.9229)
##     No Information Rate : 0.9116          
##     P-Value [Acc > NIR] : 0.55            
##                                           
##                   Kappa : -0.0009         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000000       
##             Specificity : 0.9994784       
##          Pos Pred Value : 0.0000000       
##          Neg Pred Value : 0.9115128       
##              Prevalence : 0.0884451       
##          Detection Rate : 0.0000000       
##    Detection Prevalence : 0.0004755       
##       Balanced Accuracy : 0.4997392       
##                                           
##        'Positive' Class : 1               
##

## [1] "African American Confusion Matrix"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1014  565
##          1  500 1096
##                                          
##                Accuracy : 0.6646         
##                  95% CI : (0.6478, 0.681)
##     No Information Rate : 0.5231         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.329          
##                                          
##  Mcnemar's Test P-Value : 0.04986        
##                                          
##             Sensitivity : 0.6598         
##             Specificity : 0.6697         
##          Pos Pred Value : 0.6867         
##          Neg Pred Value : 0.6422         
##              Prevalence : 0.5231         
##          Detection Rate : 0.3452         
##    Detection Prevalence : 0.5027         
##       Balanced Accuracy : 0.6648         
##                                          
##        'Positive' Class : 1              
##

## [1] "African American-Violent Confusion Matrix"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2740  421
##          1    9    5
##                                           
##                Accuracy : 0.8646          
##                  95% CI : (0.8522, 0.8763)
##     No Information Rate : 0.8658          
##     P-Value [Acc > NIR] : 0.595           
##                                           
##                   Kappa : 0.0143          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.011737        
##             Specificity : 0.996726        
##          Pos Pred Value : 0.357143        
##          Neg Pred Value : 0.866814        
##              Prevalence : 0.134173        
##          Detection Rate : 0.001575        
##    Detection Prevalence : 0.004409        
##       Balanced Accuracy : 0.504232        
##                                           
##        'Positive' Class : 1               
##

## [1] "Caucasian False Positive Rate: 16.86"

## [1] "African American False Positive Rate: 33.03"

## [1] "Caucasian False Negative Rate: 56.93"

## [1] "African American False Negative Rate: 34.02"

Our model has higher False Positive Rates for white than black individuals. It has higher False Negative Rates for white individuals too. Our model seems to not apply to races equally.

## [1] "Violent Caucasian False Positive Rate: 0.05"

## [1] "Violent African American False Positive Rate: 0.33"

## [1] "Violent Caucasian False Negative Rate: 100"

## [1] "Violent African American False Negative Rate: 98.83"

Our violent recidivism model has higher FPR for African Americans and lower FNR. This is considered racially biased as well.

## [1] "COMPAS Black False Positive Rate: 44.85 | Our model's Black False Positive Rate: 33.03"

## [1] "COMPAS Black False Negative Rate: 27.99 | Our model's Black False Negative Rate: 34.02"

## [1] "COMPAS Black False Sensitivity: 0.72 | Our model's Black Sensitivity: 0.66"

## [1] "COMPAS Black False Specificity: 0.55 | Our model's Black Specificity: 0.67"

## [1] "----- RESULTS FOR VIOLENT RECIDIVISM -----"

## [1] "COMPAS Black False Positive Rate: 44.85 | Our model's Black False Positive Rate: 0.33"

## [1] "COMPAS Black False Negative Rate: 27.99 | Our model's Black False Negative Rate: 98.83"

## [1] "COMPAS Black False Sensitivity: 0.72 | Our model's Black Sensitivity: 0.01"

## [1] "COMPAS Black False Specificity: 0.55 | Our model's Black Specificity: 1"

Model performance for Sex and Age:

As shown by these models, under our tool, Males have a higher sensitivity than Females but a lower Specificity. As for age, Sensitivity decreases from Young - Old, possibly due to the lack of true positive points for older people. Meanwhile, Specificity increases, as it is more likely to correctly categorize True (and False) Negatives.

As shown before, our violent tool is systemically low in Sensitivity and high in Specificity. Thus, Age and Sex results take these conclusions to the extreme, with only adults and the elderly having lower than 99% False Negative Rates.

## [1] "----- RESULTS FOR  SEX AND AGE -----"

## [1] "----- MALE  -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1891  949
##          1  710 1447
##                                           
##                Accuracy : 0.668           
##                  95% CI : (0.6547, 0.6811)
##     No Information Rate : 0.5205          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3323          
##                                           
##  Mcnemar's Test P-Value : 5.12e-09        
##                                           
##             Sensitivity : 0.6039          
##             Specificity : 0.7270          
##          Pos Pred Value : 0.6708          
##          Neg Pred Value : 0.6658          
##              Prevalence : 0.4795          
##          Detection Rate : 0.2896          
##    Detection Prevalence : 0.4317          
##       Balanced Accuracy : 0.6655          
##                                           
##        'Positive' Class : 1               
##

## [1] "----- FEMALE  -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 664 251
##          1  98 162
##                                          
##                Accuracy : 0.703          
##                  95% CI : (0.6759, 0.729)
##     No Information Rate : 0.6485         
##     P-Value [Acc > NIR] : 4.280e-05      
##                                          
##                   Kappa : 0.2881         
##                                          
##  Mcnemar's Test P-Value : 4.073e-16      
##                                          
##             Sensitivity : 0.3923         
##             Specificity : 0.8714         
##          Pos Pred Value : 0.6231         
##          Neg Pred Value : 0.7257         
##              Prevalence : 0.3515         
##          Detection Rate : 0.1379         
##    Detection Prevalence : 0.2213         
##       Balanced Accuracy : 0.6318         
##                                          
##        'Positive' Class : 1              
##

## [1] "----- YOUNG  -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 305 207
##          1 288 547
##                                           
##                Accuracy : 0.6325          
##                  95% CI : (0.6061, 0.6583)
##     No Information Rate : 0.5598          
##     P-Value [Acc > NIR] : 0.00000003489   
##                                           
##                   Kappa : 0.2433          
##                                           
##  Mcnemar's Test P-Value : 0.0003235       
##                                           
##             Sensitivity : 0.7255          
##             Specificity : 0.5143          
##          Pos Pred Value : 0.6551          
##          Neg Pred Value : 0.5957          
##              Prevalence : 0.5598          
##          Detection Rate : 0.4061          
##    Detection Prevalence : 0.6199          
##       Balanced Accuracy : 0.6199          
##                                           
##        'Positive' Class : 1               
##

## [1] "----- ADULT  -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1479  747
##          1  412  894
##                                           
##                Accuracy : 0.6719          
##                  95% CI : (0.6561, 0.6873)
##     No Information Rate : 0.5354          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3314          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5448          
##             Specificity : 0.7821          
##          Pos Pred Value : 0.6845          
##          Neg Pred Value : 0.6644          
##              Prevalence : 0.4646          
##          Detection Rate : 0.2531          
##    Detection Prevalence : 0.3698          
##       Balanced Accuracy : 0.6635          
##                                           
##        'Positive' Class : 1               
##

## [1] "----- OLD  -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 771 246
##          1 108 168
##                                          
##                Accuracy : 0.7262         
##                  95% CI : (0.701, 0.7504)
##     No Information Rate : 0.6798         
##     P-Value [Acc > NIR] : 0.0001638      
##                                          
##                   Kappa : 0.3103         
##                                          
##  Mcnemar's Test P-Value : 3.302e-13      
##                                          
##             Sensitivity : 0.4058         
##             Specificity : 0.8771         
##          Pos Pred Value : 0.6087         
##          Neg Pred Value : 0.7581         
##              Prevalence : 0.3202         
##          Detection Rate : 0.1299         
##    Detection Prevalence : 0.2135         
##       Balanced Accuracy : 0.6415         
##                                          
##        'Positive' Class : 1              
##

## [1] "----- RESULTS FOR VIOLENT SEX AND AGE -----"

## [1] "----- MALE VIOLENT -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4375  605
##          1   12    5
##                                           
##                Accuracy : 0.8765          
##                  95% CI : (0.8671, 0.8855)
##     No Information Rate : 0.8779          
##     P-Value [Acc > NIR] : 0.6289          
##                                           
##                   Kappa : 0.0094          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.008197        
##             Specificity : 0.997265        
##          Pos Pred Value : 0.294118        
##          Neg Pred Value : 0.878514        
##              Prevalence : 0.122073        
##          Detection Rate : 0.001001        
##    Detection Prevalence : 0.003402        
##       Balanced Accuracy : 0.502731        
##                                           
##        'Positive' Class : 1               
##

## [1] "----- FEMALE VIOLENT -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1093   82
##          1    0    0
##                                           
##                Accuracy : 0.9302          
##                  95% CI : (0.9141, 0.9441)
##     No Information Rate : 0.9302          
##     P-Value [Acc > NIR] : 0.5293          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.93021         
##              Prevalence : 0.06979         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
##

## [1] "----- YOUNG VIOLENT -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1161  185
##          1    1    0
##                                           
##                Accuracy : 0.8619          
##                  95% CI : (0.8423, 0.8799)
##     No Information Rate : 0.8627          
##     P-Value [Acc > NIR] : 0.551           
##                                           
##                   Kappa : -0.0015         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0000000       
##             Specificity : 0.9991394       
##          Pos Pred Value : 0.0000000       
##          Neg Pred Value : 0.8625557       
##              Prevalence : 0.1373422       
##          Detection Rate : 0.0000000       
##    Detection Prevalence : 0.0007424       
##       Balanced Accuracy : 0.4995697       
##                                           
##        'Positive' Class : 1               
##

## [1] "----- ADULT VIOLENT -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3099  421
##          1    8    4
##                                           
##                Accuracy : 0.8785          
##                  95% CI : (0.8673, 0.8891)
##     No Information Rate : 0.8797          
##     P-Value [Acc > NIR] : 0.5944          
##                                           
##                   Kappa : 0.0118          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.009412        
##             Specificity : 0.997425        
##          Pos Pred Value : 0.333333        
##          Neg Pred Value : 0.880398        
##              Prevalence : 0.120328        
##          Detection Rate : 0.001133        
##    Detection Prevalence : 0.003398        
##       Balanced Accuracy : 0.503418        
##                                           
##        'Positive' Class : 1               
##

## [1] "----- OLD VIOLENT -----"

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1208   81
##          1    3    1
##                                           
##                Accuracy : 0.935           
##                  95% CI : (0.9202, 0.9479)
##     No Information Rate : 0.9366          
##     P-Value [Acc > NIR] : 0.6181          
##                                           
##                   Kappa : 0.0175          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0121951       
##             Specificity : 0.9975227       
##          Pos Pred Value : 0.2500000       
##          Neg Pred Value : 0.9371606       
##              Prevalence : 0.0634184       
##          Detection Rate : 0.0007734       
##    Detection Prevalence : 0.0030936       
##       Balanced Accuracy : 0.5048589       
##                                           
##        'Positive' Class : 1               
##

Closing Words

Ultimately, our findings are slightly better than the COMPAS tool with regards to True Positives and generally have fewer False Positives. In addition, our models are consistent with the wealth of criminal justice research. Ages, sex, and number of priors are consistently associated with likelihood to recidivate. (Richard Berk, Criminal justice forecasts of risk, (New York, Springer, 2002),30.)

Our analysis reveals that these tools can sometimes adequately predict who may be likely to recidivate, but these decisions must be accompanied by a level of intentional fairness to ensure vulnerable populations are not adequately targeted.

Project B: Criminal Recidivism Prediction

Kaila Gilbert, Brian Rhindress, Matt Samach

May 3, 2019